Review of Surrogate Variable Analysis on mean heterogeneity data

In the previous document, we apply Surrogate Variable Analysis on the mean heterogeneity data. In our toy examples, there are 80 samples with 100 features, which is a high dimension and low sample size setting.

The samples belong to two classes. First 40 samples belong to Class 1 and others belong to Class 2. The samples are collected from two batches: Batch1 and Batch2.

Each sample has 100 features:

  1. \(X_1\) is the primary variables effects. For Class 1, \(X_1 = 4\); for Class 2, \(X_1 = -4\).
  2. \(X_2\) is the batch effects. For Batch 1, \(X_2 \sim \mathbb{N}(2, 1)\); for Batch 2, \(X_2 \sim \mathbb{N}(-2, 1)\)
  1. \(X_3, \dots, X_{100} \sim \mathbb{N}(0, 1)\) are the random noises.

\(R\) is the residual matrix gained by regressing the data matrix \(X\) from the primary variable (class label vector) \(Y\). Since the primary variables effects are constant, \(X_1\) in the residual matrix are 0. Then the situation is equivalent to detect the mean heterogeneity from the Gaussian random noise matrix.

In the sva package, it provides two approaches to detect the surrogate variable:

  1. Permutation test based on algorithm proposed by Bujia and Eyuboglu 1992.
  2. Asymptotic approach proposed by Leek 2011.

The simulation results indicate that SVA performs poorly on detecting the surrogate variable in our toy examples. The major problem in their algorithm is that they form the new matrix \(R^{*}\) by pemuting each row of \(R\) independently to remove any structure in the matrix. However, this operation can not indeed break down the Gaussian mixture structure. In fact, in the \(R\), \(X_3 \dots X_100\) are Gaussian random noises which are spherical symmetric. Then permuting within these rows will not change the structure of matrix. Permuting \(X_2\) will also preserve the Gaussian Mixture structure. Therefore, after permuting the matrix within each row independently, the whole matrix still has a Gaussian mixture.

In our toy examples, a more efficent way to detect the surrogate variable is permuting each columns independently.

Simulation Study

Non Mean Heterogeneity Data

In this case, \(\pi_1 = \pi_2 = 1\). It means that all the samples come from the Batch1. Therefore there is no batch effect in the data. This case is set as a baseline for understanding the behavior of eigenvalues in the analysis.

SVA analysis

## [1] "The number of surrogate variable by asymptotic approach: 0"
## [1] "The number of surrogate variable by permutation test: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5

PCA behavior

Balanced Mean Heterogeneity Data

In this case, \(\pi_1 = \pi_2 = 0.5\). It indicates the balance in the following two ways:

  • The propotion of samples from Batch 1 are the same in the two classes. This also means that the batch vector is orthogonal to primary variable (class indicator vector).
  • The sizes of samples from Batch 1 and Batch 2 are the same.

## [1] "The number of surrogate variable by asymptotic approach: 5"
## [1] "The number of surrogate variable by permutation test: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5

PCA behavior

Unbalanced Mean Heterogeneity Data

In order to better understand the behavior of SVA against heterogeneity, we design three cases for unbalanced batch effects.

Case1: \(\pi_1 = \pi_2, \pi_1 + \pi_2 \neq 1\)

In this case, it indicates:

  • The propotion of samples from Batch 1 are the same in the two classes. This also means that the batch vector is orthogonal to primary variable (class indicator vector).
  • The sizes of samples from Batch 1 and Batch 2 are different.

By the symmetric of Batch 1 and Batch 2, we only need to consider the case \(\pi_1 = \pi_2, \pi_1 + \pi_2 < 1\). We do simulation for two sets of \(\pi_1\) and \(\pi_2\):

  • \(\pi_1 = 0.4, \pi_2 = 0.4\).
  • \(\pi_1 = 0.1, \pi_2 = 0.1\).

## [1] "The number of surrogate variable by asymptotic approach: 0"
## [1] "The number of surrogate variable by permutation test: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5
## [1] "The number of surrogate variable by asymptotic approach: 0"
## [1] "The number of surrogate variable by permutation test: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5

PCA behavior

Case2: \(\pi_1 \neq \pi_2, \pi_1 + \pi_2 = 1\)

In this case, it indicates:

  • The propotion of samples from Batch 1 are diffrent in the two classes. This also means that the batch vector is not orthogonal to primary variable (class indicator vector).
  • The sizes of samples from Batch 1 and Batch 2 are the same.

By the symmetric of Batch 1 and Batch 2, we only need to consider the case \(\pi_1 > \pi_2, \pi_1 + \pi_2 = 1\). We do simulation for two sets of \(\pi_1\) and \(\pi_2\):

  • \(\pi_1 = 0.6, \pi_2 = 0.4\).
  • \(\pi_1 = 0.9, \pi_2 = 0.1\).

## [1] "The number of surrogate variable by asymptotic approach: 0"
## [1] "The number of surrogate variable by permutation test: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5
## [1] "The number of surrogate variable by asymptotic approach: 5"
## [1] "The number of surrogate variable by permutation test: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5

PCA behavior

Case3: \(\pi_1 \neq \pi_2, \pi_1 + \pi_2 \neq 1\)

In this case, it indicates:

  • The propotion of samples from Batch 1 are different in the two classes. This also means that the batch vector is not orthogonal to primary variable (class indicator vector).
  • The sizes of samples from Batch 1 and Batch 2 are different.

By the symmetric of Batch 1 and Batch 2, we only need to consider the case \(\pi_1 > \pi_2, \pi_1 + \pi_2 < 1\). We do simulation for two sets of \(\pi_1\) and \(\pi_2\):

  • \(\pi_1 = 0.5, \pi_2 = 0.4\).
  • \(\pi_1 = 0.4, \pi_2 = 0.1\).

## [1] "The number of surrogate variable by asymptotic approach: 0"
## [1] "The number of surrogate variable by permutation test: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5
## [1] "The number of surrogate variable by asymptotic approach: 0"
## [1] "The number of surrogate variable: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5

PCA behavior

Real data example: Expression from a study of bladder cancer

This is an expression set consisting of 57 samples drawn from a study of bladder cancer. The samples were collected on different dates which have been used to define 5 batches. These data are used as an example in the sva package vignette.

Here is the visualization of the Expression data.

Apply SVA algorithm to detect the surrogate variable

## [1] "Number of surrogate variable by asymptotic approach: 2"
## [1] "Number of surrogate variable by permutation test: 9"

Residual Matrix Permuting Analysis

Permuting within rows as described in the literature

Permuting within rows and then within columns